remove BOM from string read from utf-8 file

Achim Domma

Hi,

I read some text from a utf-8 encoded text file like this:

text = codecs.open('example.txt','r','utf8').read()

If I pass this text to a COM object, I can see that there is still the BOM
in the file, which marks the file as utf-8. Simply removing the first
character in the string is not ok, because the BOM is optional. So I tried
something like this:

if text.startswith(codecs.BOM_UTF8):
print "found BOM"

but then I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What's the right way to remove the BOM from the string?

regards,
Achim

Jul 18 '05 #1

Subscribe Post Reply

28215

Piet van Oostrum

>>>>> "Achim Domma" <do***@procoders.net> (AD) wrote:

AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:

AD> text = codecs.open('example.txt','r','utf8').read()

AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:

The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.

Check text[0] and len(text) to verify.

Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl

Jul 18 '05 #2

Achim Domma

"Piet van Oostrum" <pi**@cs.uu.nl> wrote in message
news:wz************@Ordesa.local...

Check text[0] and len(text) to verify.

That's what I did. The file contains 24 chinese characters and len(text) is
25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Achim

Jul 18 '05 #3

Matt Gerrans

I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it. I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):

import os, codecs
#---------------------------------------------------------------------------
-
# OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi. If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
# BOM
# BOM_BE
# BOM_LE
# BOM_UTF8
# BOM_UTF16
# BOM_UTF16_BE
# BOM_UTF16_LE
# BOM_UTF32
# BOM_UTF32_BE
# BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'. Some
# can be inferred, but some are not so clear.
#---------------------------------------------------------------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
if os.path.isfile(filename):
f = file(filename,'rb')
header = f.read(4) # Read just the first four bytes.
f.close()
# Don't change this to a map, because it is ordered!!!
encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
( codecs.BOM_UTF16, 'utf-16' ),
( codecs.BOM_UTF8, 'utf-8' ) ]
for h,e in encodings:
if header.find(h) == 0:
encoding = e
break
return codecs.open(filename,mode,encoding)

Jul 18 '05 #4

Piet van Oostrum

>>>>> "Achim Domma" <do***@procoders.net> (AD) wrote:

AD> "Piet van Oostrum" <pi**@cs.uu.nl> wrote in message
AD> news:wz************@Ordesa.local...

Check text[0] and len(text) to verify.

AD> That's what I did. The file contains 24 chinese characters and len(text) is
AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Sorry, I was wrong.
You have to check for text.startswith(u'\ufeff')

--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl

Jul 18 '05 #5

Similar topics

Multibyte string length

by: Zygmunt Krynicki | last post by:

Hello I've browsed the FAQ but apparently it lacks any questions concenring wide character strings. I'd like to calculate the length of a multibyte string without converting the whole string. ...

C / C++

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...

ASP.NET

How to remove HTTP Header [Expect: 100-continue]

by: Christian Lutz | last post by:

Hy there I have a Web Services written in Java, running on Tomcat. The Client is written in C#. When i monitor the request/Response with TCPMon (included in Tomcat) i can observer the following...

.NET Framework

byte count unicode string

by: willie | last post by:

>willie wrote: wrote:

Python

Is there a way to get utf-8 out of a Unicode string?

by: thebjorn | last post by:

I've got a database (ms sqlserver) that's (way) out of my control, where someone has stored utf-8 encoded Unicode data in regular varchar fields, so that e.g. the string 'Blåbærsyltetøy' is in the...

Python

How to remove accents (A-Umlaut to A)

by: cody | last post by:

Is there a method to replace special characters like Ä (A-Umlaut) with A, Ö (O-Umlaut) with O, and so on? Sure, I could look for each character separately and replace it with its...

.NET Framework

newbie: how do I test a byte string?

by: Slippy27 | last post by:

How do I test a byte string in Python? I want to manually convert (no libraries or functions) a UTF-8 string into UTF-16. My basic solution is to read from the stream some number of UTF-8 bytes,...

Python

How to print first(national) char from unicode string encoded inutf-8?

by: sniipe | last post by:

Hi, I have a problem with unicode string in Pylons templates(Mako). I will print first char from my string encoded in UTF-8 and urllib.quote(), for example string '£ukasz': ...

Python

how to remove webservice headers

by: GADOI | last post by:

Hello: I have a web swervice that returns a formatted KML (Google earth) xml document. Google Earth says it can consume it IF the top level element starts with : <kml...

.NET Framework

How to obtain a utf-8 string from an XmlReader?

by: SammyBar | last post by:

Hi all, I'm trying to convert the xml obtained from a XmlReader object into a UTF-8 array. My general idea is to read the XmlReader and write into a MemoryStream. Then convert the MemoryStream...

.NET Framework

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA